Opening Questions

What influences happiness?

Happiness over time?

Datasets and pre-processing

Our base dataset is the World Happiness Report ranging from 2015 to 2022. The World Happiness Report is a publication of the United Nations Sustainable Development Solutions Network. It contains articles and rankings of national happiness, based on respondent ratings of their own lives, which the report also correlates with various life factors. The dataset contains 12 columns, 1237 rows and can be seen in the table below. As our goal is to analyse the factors of happiness, the following six columns are most important for us: GDP, Family, Health, Freedom, Corruption and Generosity.
Country Region Happiness.Rank Happiness.Score Standard.Error Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Trust..Government.Corruption. Generosity Dystopia.Residual
Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
Denmark Western Europe 3 7.527 0.03328 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204

In addition, we wanted to add futher factors and added the following three datasets:

By merging the datasets we have now four additional factors:

To join all the different datasets we had to do some manual preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.

After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to create two datasets. One for analysing the happiness change over time and one for analysing the influential factors regarding happiness in only one year.

For the first dataset, the over time analysis, we only included the 6 factors from the base happiness dataset and excluded all rows containing missing values. We also renamed the columns for having shorter labels.
Country Happiness.Rank Happiness Economy Family Health Freedom Trust Generosity Year Region
Switzerland 1 7.587 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2015 Western Europe
Iceland 2 7.561 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2015 Western Europe
Denmark 3 7.527 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2015 Western Europe

For the second dataset, the influential factors analysis, we inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.

Country Happiness.Rank Happiness Economy Family Health Freedom Trust Generosity Year Region Country.Code Code Alcohol Population Tobacco Internet
Finland 1 7.632 1.305 1.592 0.874 0.681 0.393 0.202 2018 Western Europe FI FIN 10.78 5522585 19.7 88.88996
Norway 2 7.594 1.456 1.582 0.861 0.686 0.340 0.286 2018 Western Europe NO NOR 7.41 5337960 13.0 96.49166
Denmark 3 7.555 1.351 1.590 0.868 0.683 0.408 0.284 2018 Western Europe DK DNK 10.26 5752131 18.6 97.31920

missing values full data

missing values 2017

missing values 2018

Preliminary analyses

One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors analysis dataset, as it contains the most explanatory variables.

Boxplots, scale data?

First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “population” and “Internet usage”. As we don’t want to have the following analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)

##    Happiness        Economy           Family          Health      
##  Min.   :2.905   Min.   :0.0760   Min.   :0.372   Min.   :0.0000  
##  1st Qu.:4.486   1st Qu.:0.7040   1st Qu.:1.063   1st Qu.:0.4475  
##  Median :5.483   Median :1.0100   Median :1.314   Median :0.6750  
##  Mean   :5.489   Mean   :0.9335   Mean   :1.247   Mean   :0.6283  
##  3rd Qu.:6.332   3rd Qu.:1.2240   3rd Qu.:1.481   3rd Qu.:0.8180  
##  Max.   :7.632   Max.   :1.5760   Max.   :1.644   Max.   :1.0080  
##     Freedom           Trust          Generosity        Alcohol      
##  Min.   :0.0250   Min.   :0.0000   Min.   :0.0000   Min.   : 0.003  
##  1st Qu.:0.3875   1st Qu.:0.0500   1st Qu.:0.1020   1st Qu.: 3.220  
##  Median :0.5040   Median :0.0880   Median :0.1670   Median : 7.150  
##  Mean   :0.4758   Mean   :0.1195   Mean   :0.1840   Mean   : 6.842  
##  3rd Qu.:0.5835   3rd Qu.:0.1450   3rd Qu.:0.2545   3rd Qu.:10.385  
##  Max.   :0.7240   Max.   :0.4570   Max.   :0.5980   Max.   :15.090  
##    Population           Tobacco         Internet    
##  Min.   :3.367e+05   Min.   : 3.70   Min.   : 4.10  
##  1st Qu.:5.488e+06   1st Qu.:13.90   1st Qu.:37.60  
##  Median :1.444e+07   Median :22.20   Median :68.21  
##  Mean   :6.007e+07   Mean   :22.02   Mean   :60.43  
##  3rd Qu.:4.430e+07   3rd Qu.:27.90   3rd Qu.:82.81  
##  Max.   :1.428e+09   Max.   :45.50   Max.   :99.60

box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
  geom_boxplot() + 
  geom_jitter(aes(color=Country), size = 0.5) +
  ggtitle("Happiness Score for Regions and Countries") + 
  coord_flip() + 
  theme(legend.position="none")
ggplotly(box)
colnames(data_2018)
##  [1] "Country"        "Happiness.Rank" "Happiness"      "Economy"       
##  [5] "Family"         "Health"         "Freedom"        "Trust"         
##  [9] "Generosity"     "Year"           "Region"         "Country.Code"  
## [13] "Code"           "Alcohol"        "Population"     "Tobacco"       
## [17] "Internet"
correlation_data <- data_2018[,correlation_categories]
ggpairs(correlation_data, title="correlation matrix for influential factors analysis", )

correlation matrices

regression

PCA (Colour by region) + biplot (or PLS)

SOM (2018)

Happiness over time?

geography map (color each country base on the percentage change over time (2015-2022))

What influences happiness

Future work